Skip to content

feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage#46

Merged
jb-thery merged 6 commits into
developfrom
feature/rag-retrieval-security-overhaul
Jul 3, 2026
Merged

feat(core): RAG retrieval overhaul (RRF + IVF_PQ), redaction hardening, and test coverage#46
jb-thery merged 6 commits into
developfrom
feature/rag-retrieval-security-overhaul

Conversation

@jb-thery

@jb-thery jb-thery commented Jul 3, 2026

Copy link
Copy Markdown
Member

Summary

RAG retrieval, redaction, config, and test-coverage overhaul driven by the deep audit. Split into 6 logical conventional commits.

Changes

  • chore: dist/ is now gitignored build output (built locally with pnpm build); dropped the CI committed-dist check.
  • feat(query): hybrid retrieval fuses vector + BM25 with weighted Reciprocal Rank Fusion (rank-only, no score calibration needed). Vector weight 0.7 / lexical 0.3. Recall stays 1.0 on the golden set.
  • feat(store): trains an IVF_PQ vector index automatically once the corpus reaches 256 rows (numPartitions ≈ √rows, clamped), keeping flat scan for small corpora. Unblocks query scalability beyond brute force.
  • fix(redaction): Luhn verification on credit-card numbers (no more over-redacting non-card digit runs), URL username now redacted alongside the password, plus Stripe / GitLab / Bearer providers and a verify: "luhn" opt-in on patterns.
  • feat(core): strict config schema (typos rejected), stderr warnings on invalid env overrides, access-log retention (trims past 10 MB), bounded LRU Transformers cache + clearTransformersCache(), and CLI option parsers extracted to a testable cli-options.ts.
  • test: suite 132 → 151 cases / 23 files — destroy, ask(), store manifest, embeddings, ingest --rebuild, config strict, access-log rotation, evaluate miss, redaction adversarial, CLI parsers, text tokenization.

Confidentiality posture verified

security-audit on the real monorepo index (681 chunks): zeroTelemetry=true, llmGeneration=false, transformersAllowRemoteModels=false, redactionEnabled=true, storageGitIgnored=true.

Checklist

  • pnpm lint clean
  • pnpm check clean
  • pnpm test — 151/151
  • pnpm build clean
  • pnpm smoke green (CLI + MCP 8 tools + license-webhook + release preflight)
  • commitlint 0 problems on all 6 commits

jb-thery added 6 commits July 4, 2026 00:26
Move all packages/*/dist/ directories from committed artifacts to gitignored
build output. dist/ is regenerated locally with `pnpm build` before running the
CLI, MCP smoke, the library-API demo, or `pnpm validate`.

- .gitignore: ignore ragmir-core/dist, ragmir-tts/dist (already ignored for
  app/landing/license-webhook); add *dist catch-all.
- ci.yml: drop the `git diff --exit-code -- dist` step that enforced committed
  dist, since dist is no longer tracked.
- AGENTS.md, CLAUDE.md, README.md, library-api-demo README: document that dist
  is gitignored and must be built locally; warn against `npx ragmir` for local
  testing (resolves the published npm package, not the working copy).
Replace the weighted-sum fusion (vector and BM25 scores divided by their max)
with Reciprocal Rank Fusion, the standard hybrid-retrieval approach. Each
candidate scores `weight / (RRF_K + rank)` per retriever it appears in, summed
across retrievers, so the BM25 and vector score distributions never need
calibration against each other.

The vector retriever is weighted higher (0.7) than the lexical one (0.3)
because, with the default local-hash embeddings, vector proximity is the more
discriminant signal on small corpora; the lexical weight still lets exact-
keyword evidence pull in candidates the vector retriever missed.

- RRF_K = 60 (Cormack et al. 2009 constant).
- Remove the now-unused weighted-sum helpers (vectorScore, normalizeScore) and
  the normalizeForMatch import left dead by the refactor.

Retrieval recall stays at 1.0 on the sovereign-rag-demo golden set.
Above a 256-row threshold, automatically create an IVF_PQ index on the vector
column after writing the table. Below the threshold, LanceDB keeps using an
exact flat scan, which is optimal for small corpora and avoids wasted index-
training work.

- numPartitions ≈ sqrt(rowCount), clamped to [8, 1024] (LanceDB production
  heuristic).
- numSubVectors = 16 (divides the 384-dim local-hash/mxbai-xsmall vectors).
- index creation is idempotent (skipped if vector_idx exists) and best-effort
  (a training failure on edge-case dimensionality leaves the table usable via
  flat scan rather than failing the ingest).

This unblocks query scalability beyond brute-force scan without changing the
overwrite write path.
Close two confidentiality gaps and broaden provider coverage in the built-in
redaction patterns:

- credit_card: add a match-then-verify Luhn check (new RedactionPattern.verify
  field). Numeric runs that are not valid card numbers (version numbers,
  account IDs, hex runs) are left untouched instead of being over-redacted.
- url_credentials: extend the pattern so both the username and the password are
  redacted. Previously only the password was stripped, leaking the username.
- Add Stripe secret keys (sk_live/rk_live/sk_test), GitLab tokens (glpat-), and
  generic Bearer tokens. Order the more specific patterns before the generic
  api_token so they win on overlap.
- Add an optional `verify: "luhn"` to the RedactionPattern type so custom
  patterns can opt into the same check.
…d use

Several additive robustness and observability improvements, plus extraction of
the CLI option parsers into a testable module:

- config: make rawConfigSchema strict so unknown keys (typos) are rejected
  instead of silently ignored; warn on stderr when an env override (e.g.
  RAGMIR_TOP_K=abc) is invalid so operators notice a no-op override.
- access-log: bound the log growth with a soft cap. When the file exceeds
  10 MB, trim it to the most recent 50 000 lines before the next append, so a
  long-lived MCP server cannot grow it without limit or OOM a usage report.
- embeddings: bound the Transformers.js pipeline cache to 3 entries with LRU
  eviction, and export clearTransformersCache(). destroyIndex now calls it so a
  re-ingest with a different embedding config does not pin stale ONNX weights.
- cli-options: extract the pure option parsers (parsePositiveInt, parseNumber,
  parseRecallThreshold, audioEngine, audioAllowRemoteModels, audioLanguage,
  parseAgentInstallScope, parseAgentInstallMode) into a dedicated module so
  they can be unit-tested without importing commander. cli.ts imports them.
  parsePositiveInt now rejects fractional input like "1.5" instead of silently
  truncating via parseInt.
Close the test-coverage gaps the audit identified, raising the suite from 132
to 151 cases across 23 files:

- destroy.test.ts (new): destroyIndex removed flag and access-log entry.
- query.test.ts: ask() empty-sources and populated cited-retrieval branches.
- store.test.ts: empty-text-files manifest round-trip, removal on empty,
  missing, malformed, and malformed-entry filtering; writeRows zero-rows
  dropTable and full re-write.
- embeddings.test.ts: embedTexts([]) early return and clearTransformersCache.
- ingest.test.ts: --rebuild forces a full re-index (reusedFiles === 0).
- config.test.ts: strict() rejects unknown keys; non-object config rejected.
- access-log.test.ts: retention trims past 10 MB; disabled logging writes
  nothing.
- evaluate.test.ts: miss case (hit=false, bestRank=null, recall=0).
- redaction.test.ts: Luhn pass/fail, URL username redacted, Stripe/GitLab/
  bearer providers, obfuscation limitation documented.
- cli.test.ts (new): all cli-options parsers incl. the MP3-without-engine
  confidentiality guard and agent scope/mode validation.
- text.test.ts (new): tokenize/normalizeForMatch (the BM25 foundation).
@jb-thery jb-thery merged commit c6253d2 into develop Jul 3, 2026
7 checks passed
@jb-thery jb-thery deleted the feature/rag-retrieval-security-overhaul branch July 3, 2026 17:37
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown

🎉 This PR is included in version 2.1.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant